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Abstract 

Background: A myriad of methods to reverse-engineer transcriptional regulatory networks have been developed in 
recent years. Direct methods directly reconstruct a network of pairwise regulatory interactions while module-based 
methods predict a set of regulators for modules of coexpressed genes treated as a single unit. To date, there has 
been no systematic comparison of the relative strengths and weaknesses of both types of methods. 
Results: We have compared a recently developed module-based algorithm, LeMoNe (Learning Module Networks), 
to a mutual information based direct algorithm, CLR (Context Likelihood of Relatedness), using benchmark 
expression data and databases of known transcriptional regulatory interactions for Escherichia coli and Saccha- 
romyces cerevisiae. A global comparison using recall versus precision curves hides the topologically distinct nature 
of the inferred networks and is not informative about the specific subtasks for which each method is most suited. 
Analysis of the degree distributions and a regulator specific comparison show that CLR is 'regulator-centric', 
making true predictions for a higher number of regulators, while LeMoNe is 'target-centric', recovering a higher 
number of known targets for fewer regulators, with limited overlap in the predicted interactions between both 
methods. Detailed biological examples in E. coli and S. cerevisiae are used to illustrate these differences and 
to prove that each method is able to infer parts of the network where the other fails. Biological validation of 
the inferred networks cautions against over-interpreting recall and precision values computed using incomplete 
reference networks. 

Conclusions: Our results indicate that module-based and direct methods retrieve largely distinct parts of the 
underlying transcriptional regulatory networks. The choice of algorithm should therefore be based on the particular 
biological problem of interest and not on global metrics which cannot be transferred between organisms. The 
development of sound statistical methods for integrating the predictions of different reverse-engineering strategies 
emerges as an important challenge for future research. 



Background 

Due to the success of microarray technology, the 
available data on the transcriptional regulatory net- 
works of different organisms has grown exponen- 
tially. In order to explore these data to the max- 
imum, a myriad of methods to reverse-engineer or 
reconstruct transcriptional regulatory networks from 
microarray data have been developed in the past 
few years. In general, the scientific community has 
mainly focused on the overall performance of newly 
developed methods in reconstructing the known net- 
work of certain model organisms as compared to 
a reference network, measuring algorithmic perfor- 
mance with standard measures such as recall and 
precision. Less attention has been paid to what ex- 
tent conceptually different approaches differ in the 
networks they infer. Nonetheless, in order to get 
a better understanding of the systems studied it is 
also important to understand which specific prob- 
lems can be tackled using a certain method, irre- 
spective of the overall performance of the different 
methods. 

Broadly speaking we can distinguish between two 
classes of methods for reverse-engineering transcrip- 
tional regulatory networks from gene expression data 
which differ vastly in how they approach the network 
inference problem. Direct methods infer individual 
regulator-target interactions using a pairwise corre- 
lation measure between the expression profiles of a 
transcription factor and its putative targets [1,2]. 
Module-based methods assume a modular structure 
of the transcriptional regulatory network [3-5] , with 
genes subject to the same regulatory input being or- 
ganized in coexpression modules. 

While different direct methods have been com- 
pared to each other in the past [2,6,7], no system- 
atic comparison between direct and module-based 
methods has been undertaken so far. In this study 
we perform such a comparison using a representa- 
tive method from each class. The CLR (Context 
Likelihood of Relatedness) algorithm [2] considers all 
possible pairwise regulator-target interactions and 
scores these interactions based on the mutual infor- 
mation of their expression profiles as compared to 
an interaction specific background distribution. It 
has been shown to outperform other direct meth- 
ods [2]. The LeMoNe (Learning Module Networks) 
algorithm [8] uses probabilistic, ensemble-based op- 
timization techniques [8,9] to infer high-quality mod- 
ule networks [3], where genes are first partitioned 
into coexpression modules and regulators are as- 



signed to modules based on how well they explain 
the condition-dependent expression behavior of the 
module. It has been shown to outperform the origi- 
nal module network algorithm [8]. 

We have compared both methods at increas- 
ing levels of detail using public expression compen- 
dia for Escherichia coli [2] and Saccharomyces cere- 
visiae [10], two organisms for which relatively large 
databases of known transcriptional regulatory in- 
teractions exist [11,12]. We first use recall versus 
precision curves to give a comparison of the global 
performance of both methods. We then show that 
due to the different assumptions underlying both 
methodologies, they infer topologically distinct net- 
works with limited overlap, even at equal perfor- 
mance thresholds. To understand these distinctions 
more completely, we examined in detail example sub- 
systems of the network which are well characterized, 
namely the chemotaxis and flagellar system in E. coli 
and a respiratory module and a membrane lipid and 
fatty acid metabolism module in S. cerevisiae. Bi- 
ological validation of the inferred networks cautions 
against over-interpreting recall and precision values 
computed using incomplete reference networks. 



Results and Discussion 

Global comparison using recall and precision 

The output of LeMoNe and CLR consists of a list 
of respectively ranked regulator-module and ranked 
regulator-target interactions, scored according to 
their statistical significance. As a first, global, com- 
parison, we can therefore compute recall and pre- 
cision with respect to the given reference networks 
at different score cutoffs. For CLR we can directly 
compare the inferred network with the true network; 
for LeMoNe we draw an edge between each regulator 
assigned to a module and all genes in the module, 
thereby ignoring at this stage the extra information 
present in the module structure. We computed re- 
call and precision as in [2]: if an edge is predicted 
between two genes present but unconnected in the 
reference network it is counted as a false positive. 
Figure [l] shows the recall versus precision curves for 
both algorithms and both organisms. 

Both algorithms succesfully prioritize true pos- 
itive interactions, especially in E. coli: all curves 
go from a high precision, low recall region to a low 
precision, high recall region. For CLR the curves 
show a smooth course while for LeMoNe they are 
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Figure 1: Recall versus precision curves for LeMoNe (red) and CLR (blue) for E. coli (a) and S. cerevisiae 
(b). Note the difference in scale between both organisms. 



more staircase-like. CLR scores individual interac- 
tions and as a result, in the recall-precision curve 
interactions will be added one by one, but interac- 
tions corresponding to a certain regulator will be dis- 
persed continuously throughout the recall-precision 
curve. LeMoNe on the other hand assigns a regula- 
tor to a module as a whole and all targets belonging 
to the same module are added at the same time in 
the recall-precision curve. For a stringent threshold 
and subsequently a low number of interactions in- 
ferred, the CLR network will cover few interactions 
for many regulators while the LeMoNe network will 
retrieve many interactions for few regulators. 

At similar levels of precision, the recall in S. cere- 
visiae is nearly an order of magnitude smaller than 
in E. coli^ in line with previous studies [13]. This 
is likely due to the higher complexity of transcrip- 
tional regulation in S. cerevisiae with a higher degree 
of combinatorial regulation and posttranscriptional 
control, and consequently a lower degree of correla- 
tion in expression between transcription factors and 
their targets. 

A simple 'area under the curve' measurement 
would suggest that CLR performs slightly better in 
the prokaryote E. coli and LeMoNe in the eukaryote 
S. cerevisiae. However, as we will show below, both 
algorithms infer complementary information in both 



organisms. 



Topological distinctions between inferred networks 

As explained in the previous section, due to how 
interactions are scored, direct and module-based 
methods will infer different kinds of networks at 
stringent precision thresholds. For E. coli^ we com- 
pared the LeMoNe and CLR networks at a 30% pre- 
cision threshold where both networks have nearly 
equal recall and precision (see Figure [T]). The 
LeMoNe network consists of 53 regulators assigned 
to 62 modules for a total of 1079 predicted inter- 
actions; 594 of these interactions are between genes 
in RegulonDB, with a precision of 29%. The corre- 
sponding CLR network contains 1422 predicted in- 
teractions for 242 regulators; 597 of these interac- 
tions are between genes in RegulonDB, with a pre- 
cision of 30%. 51 out of 53 LeMoNe regulators are 
also present in the CLR network, but only 277 inter- 
actions are predicted in both networks. For S. cere- 
visiae^ there is no 'natural' point on the recall ver- 
sus precision curve to compare both networks. We 
therefore compared CLR and LeMoNe at the first 
1070 predicted interactions. This number is chosen 
to give comparably sized networks as in E. coli and 
ensure that the ranked list of LeMoNe interactions is 
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Figure 2: (a) E. coli in-degree distribution for LeMoNe (red) and CLR (blue) at 30% precision threshold, 
(b) E. coli out-degree distribution for LeMoNe (red) and CLR (blue) at 30% precision threshold, (c) S. 
cerevisiae in-degree distribution for LeMoNe (red) and CLR (blue) at first 1070 predictions, (d) S. cerevisiae 
out-degree distribution for LeMoNe (red) and CLR (blue) at first 1070 predictions. 



not cut off in the middle of one module. The cutoff 
of the first 1070 interactions corresponds to precision 
values of respectively 16% and 10% for LeMoNe and 
CLR (cfr. Figure [T]). The LeMoNe network consists 
of 34 regulators assigned to 39 modules containing 
867 genes, while the CLR network contains 214 reg- 
ulators; 28 regulators are present in both networks, 
yet only 75 interactions are common. 

The networks inferred by LeMoNe and CLR are 
topologically very distinct (see Supplementary Fig- 
ures SI to S4). This distinction can be quantified 
by their in- and out-degree distributions (Figure [2|. 



The in-degree is the number of regulators assigned to 
a certain target gene and the in-degree distribution 
counts for each value k the number of targets with 
in-degree k. Likewise, the out-degree is the num- 
ber of targets assigned to a certain regulator and 
the out-degree distribution counts for each value k 
the number of regulators with out-degree k. CLR 
infers for each regulator only the most significant 
targets. As a result, the out-degree distribution is 
skewed to the left, with the majority of regulators 
having only few targets. The in-degree distribution 
on the other hand has a long tail of genes assigned 
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to many different regulators. LeMoNe infers for each 
module the most significant regulators, resulting in 
opposite characteristics of the degree distributions. 
The in-degree distribution has no tail since for most 
modules at most 2 significant regulators are identi- 
fied. The out-degree distribution on the other hand 
has a long tail since each regulator assignment in- 
volves a whole module of genes. For these reasons, 
we say that CLR is 'regulator-centric' and LeMoNe 
is 'target-centric'. 



Regulator specific comparison 

We make a further comparison of the two methods, 
focusing on how they differ in the type of regulators 
they assign. We compared again the 30% precision 
networks for E. coli and the networks of first 1070 
interactions for S. cerevisiae. 

For both methods, a large fraction of the regu- 
lators for which known targets are inferred are au- 
toregulators. For E. coli, LeMoNe and CLR have re- 
spectively 19 and 32 regulators with at least one true 
positive; 15/19 (79%) and 27/32 (84%) are known 
autoregulators, while the fraction of autoregulators 
in the total reference network is 95/150 (63%). For 
S. cerevisiae, LeMoNe and CLR have respectively 
6 and 10 regulators with at least one true positive; 
5/6 (83%) and 5/10 (50%) are known autoregula- 
tors, while the fraction of autoregulators in the total 
reference network is 79/171 (46%). The abundance 
of autoregulators is not surprising since autoregula- 
tion is a simple mechanism by which the expression 
profile of a regulator and its targets can be corre- 
lated. 

In LeMoNe, we get as additional information 
whether a predicted regulator is positively or neg- 
atively correlated with its target module and Reg- 
ulonDB, the reference network for E. coli, contains 
the activation or repression sign for many interac- 
tions. However, although theoretically possible, we 
could not detect biologically relevant patterns of an- 
ticorrelation, in line with previous studies [14]. Even 
though the assumption of anticorrelation seems in- 
tuitively plausible in case of repressors, it is a too 
simplistic representation of reality. Indeed LeMoNe 
and CLR both find many targets of mainly autore- 
pressors (e.g. LexA, PurR, LldR and GalS), but 
they all were positively instead of negatively corre- 
lated with their targets. This can be explained by 
the fact that the activity of such autorepressors is 
dependent upon the presence of corepressing signals. 



In the absence of the corepressing signal the repres- 
sor is active, limiting its own production as well as 
that of its target genes. In presence of the corepress- 
ing signal the repressors are inactive, which enables 
the production of both inactive repressor gene and 
its targets [15-17]. 

In E. coli, regulators for which the module-based 
and direct methods differ in performance are in line 
with the topological distinctions. CLR is better at 
inferring interactions for regulators that are known 
to regulate just one or a few operons (e.g. Betl, 
CsgD, DnaA, Mar A, Yhhg, see Figure [3|. These 
operons are found with a relatively high rank in 
the CLR network since their regulators often belong 
themselves to the operons and are thus by definition 
tightly coexpressed with their targets. The cluster- 
ing method employed by LeMoNe appears to be too 
coarse grained to identify these operons individu- 
ally, since they are mostly part of larger clusters. 
LeMoNe on the other hand is superior at inferring 
interactions for regulators that are known to regulate 
larger regulons, such as Fis, LexA, PurR, and RpoS, 
for which the level of coexpression is not as high as 
the one observed within a single operon (see Figure 
[3|. In 5. cerevisiae, there is no operonic structure 
and hence the 'operon regulators' acurately identi- 
fied by CLR are absent. Figure |4] show however 
that the regulators for which LeMoNe and CLR infer 
known targets are still very distinct, but there ap- 
pears to be no general biological reason underlying 
these differences. 



Biological validation of inferred networks 

Due to the lack of a negative gold standard, we have 
denoted in the previous analysis an edge as being 
false positive if both regulator and target are present 
but not connected in the reference network (the pos- 
itive gold standard). Since the coverage of these ref- 
erence networks is still very incomplete, it is likely 
that the number of false positives is overestimated. 
Moreover, about half of the regulators in E. coli and 
S. cerevisiae are not present in the reference net- 
work and their predicted interactions are thus never 
evaluated. 

In [2] , it was already shown that new predictions 
made by CLR in E. coli could be validated experi- 
mentally. Here we have performed an in-depth bio- 
logical validation of the 30% precision module net- 
work inferred by LeMoNe. To biologically validate 
the obtained regulator- module assignments, we cal- 
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Figure 3: For each regulator in E. coli with known interactions inferred: (a) the number of interactions in 
the reference network (green) and the number of true positives in LeMoNe (red) and CLR (blue); (b) the 
number of interactions inferred (green) and the number of true positives (red) in LeMoNe, and the number 
of interactions inferred (yellow) and the number of true positives (blue) in CLR. LeMoNe and CLR networks 
are both at 30% precision threshold. Regulators are sorted by the difference TPLeMoNe — TPclr- The total 
number of true positives is 171 for LeMoNe and 180 for CLR. For clarity, the x-axis in (a) is truncated, the 
true number of targets for Fis and Fnr is respectively 111 and 173. The number of interactions inferred only 
counts targets that belong to the reference network. 



culated for all modules functional enrichment scores 
[18] and enrichment in targets of previously anno- 
tated regulators [11]. Table [l] shows that in nearly 
all cases the module is enriched in known targets 
of the predicted regulator (column 4) or at least in- 



volved in the same biological function (column 6). In 
several cases the predicted regulator is the one which 
has the best target enrichment value. Nearly half 
of the regulators are putative regulators without any 
currently known targets, and these assignments can- 
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Figure 4: For each regulator in S. cerevisiae with known interactions inferred: (a) the number of interactions 
in the reference network (green) and the number of true positives in LeMoNe (red) and CLR (blue) ; (b) the 
number of interactions inferred (green) and the number of true positives (red) in LeMoNe, and the number 
of interactions inferred (yellow) and the number of true positives (blue) in CLR. LeMoNe and CLR networks 
are both cut off at the first 1070 predictions. Regulators are sorted by the difference TPLeMoNe — TPclr- The 
total number of true positives is 40 for LeMoNe and 31 for CLR. For clarity, the x-axis in (a) is truncated, 
the true number of targets for GCN4 is 120. The number of interactions inferred only counts targets that 
belong to the reference network. 



not be validated. However, many of the correctly 
predicted regulators involve neighbor regulators [19] 
(Table [ij column 7), i.e. regulators colocalized with 
their targets on the genome. It has been suggested 
that many of the putative regulators in E. coli con- 
stitute such neighbor regulators [20] . Hence this fea- 
ture of gene neighborhood can be used to attach ad- 
ditional significance to the high-scoring predictions 
for uncharacterized regulators. One of the advan- 
tages of a module-based approach is the fact that 
if a certain module contains several known targets 
of the assigned regulators, the rest of the unknown 
targets in this module can be considered high con- 
fidence predictions for that regulator. This is illus- 
trated in Supplementary Table SI, where we list sev- 
eral predictions for 10 different modules which could 
be confirmed by a thorough literature search. 

Module network predictions in S. cerevisiae have 
been experimentally validated in [3] and functionally 
analysed in [3,8]. For further validation we com- 
pared the CLR and LeMoNe networks to the YEAS- 
TRACT database [21]. This database contains most 
of the interactions in the reference network we use 
here [12]. In addition it also contains targets inferred 



by transcription factor deletion microarray experi- 
ments. The number of true positives for the LeMoNe 
network cut off at the first 1070 predictions increases 
from 40 (precision 16%) in the reference network to 
55 (precision 24%) with respect to YEASTRACT. 
For the CLR network cut off at the first 1070 pre- 
dictions, the number of true positives increases from 
31 (precision 10%) in the reference network to 48 
(precision 12%) with respect to YEASTRACT. 

Biological validation of inferred networks is te- 
dious and does not provide an easy alternative to 
the automatic estimation of true and false positives 
using an established reference network. The results 
of this section do show however that many 'false pos- 
itives' with respect to an incomplete network are ac- 
tually true positives when additional information is 
taken into account and that recall versus precision 
plots such as in Figure [l] have to be interpreted with 
caution. 

The chemotaxis and flagellar system in Escherichia 
coli 

Our analysis has shown that at equal levels of re- 
call and precision, LeMoNe predicts interactions for 
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Figure 5: (a) Operons encoding the proteins of the chemotaxis and flagellar system in E. coli. The under- 
lined genes belong to operons activated by FlhDC but have additional promoters activated by FliA. They 
are expressed partially as class 2 genes and fully as class 3 genes. Table and data after [22]. Genes belonging 
to module 12 are indicated in red, to module 18 in green, to module 24 in blue and to module 45 in magenta, 
(b) Pairwise clustering frequencies in the LeMoNe clustering ensemble [8, 9] for the flagella genes. Each 
row/column corresponds to a gene in one of the flagella modules and the heat map value at position {i^j) 
is the frequency with which gene i and j cluster together. The blocks along the diagonal correspond to 
respectively module 12, 18, 24 and 45. In module 24, it can be seen that the coclustering frequencies of flhD 
with the other members is rather low, indicating a weaker degree of coexpression. See also Supplementary 
Figure S5. 



fewer regulators but with higher coverage per regu- 
lator while CLR predicts fewer interactions per reg- 
ulator but for more regulators. It is instructive to 
analyse in detail the implications of these differences 
for subsystems of the transcriptional regulatory net- 
work which are particularly well perturbed in the 
data set. For E. coli^ we have taken a closer look at 
the chemotaxis and flagellar system which forms a 
complex and tightly regulated system. It consists of 
the class 1 master operon flhDC^ 8 class 2 operons 
activated by the complex FlhDC, and at least 6 class 
3 operons activated by the sigma factor FliA (Figure 
[5] (a)). The fliA operon belongs to class 2, positively 
regulates its own production and can activate other 
class 2 operons as well [22]. 



Four modules (12, 18, 24 and 45) in the module 
network are enriched in flagellar functions. Together 
they contain 60 genes of which 55 are known flagel- 
lar genes. The separation of flagellar genes in differ- 
ent modules is strongly supported by the LeMoNe 
clustering (Figure [s] (b)), suggesting the presence 
of condition-specific regulation in the flagellar gene 
network, and corresponds to the difference in regula- 
tory input between different classes of flagellar genes 
(Figure [sj see also Supplementary Figure S5). In the 
30% precision LeMoNe network, FliA is assigned to 
all four modules and FlhC is correctly assigned to the 
class 2 modules 18 and 24 only. FlhD is not assigned 
with a score high enough to make the threshold. 

At the 30% precision cutoff, LeMoNe and CLR 
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agree for the majority of predicted interactions for 
FliA and FlhC. In addition, CLR infers several cor- 
rect targets for FlhD. The coexpression of FlhD with 
its predicted targets is significantly lower than for 
FliA or FlhC. This is evidenced for instance from the 
LeMoNe clustering (Figure [5](b)) or CLR mutual in- 
formation values (data not shown). However, due to 
the regulator-centric viewpoint and the 'local' back- 
ground correction method of CLR, these relatively 
weakly coexpressed targets still get a significant mu- 
tual information z-score and are thus part of the pre- 
dicted network. In the target-centric LeMoNe net- 
work, the potential assignment of FlhD to the fiag- 
ella modules is compared to the much better scor- 
ing assignments of FliA and/or FlhC and therefore 
not deemed significant enough. Hence the regulator- 
centric CLR approach has the advantage to iden- 
tify significant targets for all three fiagellar regula- 
tors, but does not distinguish well between regula- 
tion by FlhDC and FliA due to the large overlap in 
predicted targets. The target-centric LeMoNe ap- 
proach on the other hand has the advantage to in- 
fer detailed condition-specific regulatory information 
through the division in distinct modules of the fiagel- 
lar genes, but only infers targets for FliA and FlhC. 

The respiratory module and membrane lipid and 
fatty acid metabolism module in Saccharomyces 
cerevisiae 

Despite the overall low performance on S. cerevisiae, 
LeMoNe and CLR both achieve good results on 
particular subsystems. The advantage of a target- 
centric approach is well exhibited by the respiratory 
system. This system is well perturbed in the data set 
and clusters of respiratory genes are found repeat- 
edly in it using various approaches [3,8,23]. LeMoNe 
module 7 contains 30 genes of which 23 are known 
respiratory genes. Hap4, a global regulator of respi- 
ratory genes, is the most significant regulator for this 
module and indeed 25 of its genes are known Hap4 
targets. The pairwise correlation between Hap4 and 
its targets varies, and since CLR scores all inter- 
actions individually, they are dispersed throughout 
the ranked list of interactions. As a result, there are 
only 12 predicted Hap4 targets (7 TP) in the first 
1070 CLR interactions (see also Figure [4|. Clearly, 
the preliminary step of clustering genes into target 
modules was necessary here to infer the complete 
Hap4 regulated module. 

Another interesting example is given by LeMoNe 



module 11, a module of 47 genes involved in mem- 
brane lipid and fatty acid metabolism. The four 
highest-ranked regulators by LeMoNe for this mod- 
ule (Gatl, Met28, Met32 and DalSO) ah have known 
targets in it. However, due to how regulators are 
scored in LeMoNe, there are rarely more than two 
significant regulators per module (see Figure |2] (a) 
and (c)), and only the assignments of Gatl (3 TP) 
and Met28 (4 TP) are present in the network of the 
first 1070 LeMoNe interactions. CLR on the other 
hand finds the most significant targets for each reg- 
ulator individually and thus identifies correct tar- 
gets from module 11 for the other regulators as well: 
Met28 (1 TP), Met32 (6 TP) and Dal80 (6 TP). 
For Gatl, CLR does not find true positives, how- 
ever it finds 5 TP in module 11 for a fifth regulator 
Gln3. Hence for this module, the most complete in- 
formation is retrieved by combining the output of 
LeMoNe and CLR. The genes and predicted regu- 
lators of module 11 are mostly involved in 2 path- 
ways, the methionine pathway (regulated by Met28 
and Met32) and the nitrogen catabolite repression 
(NCR) system (regulated by Gatl, Dal80 and Gln3). 
Module 11 is overexpressed in nitrogen depletion and 
amino acid starvation conditions. For NCR-sensitive 
genes it is known that they are not activated when 
rich nitrogen sources are available, but get expressed 
when only poor sources are left. A link between the 
methionine pathway and nitrogen depletion, as pre- 
dicted by LeMoNe through the clustering and by 
CLR through the assignment of common targets to 
these regulators, is not evident but appears to be 
confirmed by an ongoing study [24]. 

Conclusion 

In recent years, a wide variety of methods to reverse- 
engineer transcriptional regulatory networks from 
microarray data have been developed. Whereas the 
development of a new method mostly coincides with 
a comparison in overall performance to all existing 
methods, so far no in-depth study on how concep- 
tual differences relate to differences in the inferred 
networks have been made. Here we distinguished be- 
tween two main approaches for reverse-engineering 
transcriptional regulatory networks: the module- 
based approach and the direct approach. We com- 
pared a representative algorithm of each approach 
(module based LeMoNe versus direct CLR) at sev- 
eral levels of detail for two different organisms, the 
prokaryote E. coli and the eukaryote S. cerevisiae. 
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Figure 6: LeMoNe module 11 with genes (bottom) and predicted regulators (top) involved in the methionine 
pathway (regulated by Met28 and Met32) and the nitrogen catabolite repression system (regulated by Gatl, 
Dal80 and Gln3). The regulators are, from top to bottom: Gatl, predicted by LeMoNe, 1 target predicted by 
CLR; Dal80, 11 targets predicted by CLR; Gln3, 6 targets predicted by CLR; Met28, predicted by LeMoNe, 
4 targets predicted by CLR; Met32, 23 targets predicted by CLR. The upregulated (green) conditions are 
all amino acid starvation or nitrogen depletion conditions. 



We have found that CLR is 'regulator-centric', mak- 
ing few but highly significant predictions for a large 
number of regulators. LeMoNe on the other hand 
is 'target-centric', identifying few but highly signifi- 
cant regulators for a large number of genes grouped 
in coexpression modules. Through a regulator spe- 
cific comparison and analysis of specific biological 
subsystems, we have shown that at stringent signif- 
icance cutoffs, the conceptual differences in statis- 
tically scoring potential regulatory interactions lead 
to topologically distinct inferred networks containing 
different kinds of regulators and biological informa- 
tion. Our results show that the choice of algorithm 
should be made primarily based on whether the bio- 
logical question under study falls within the target- 
centric or regulator-centric viewpoint, and not on 
global metrics which cannot be transferred between 
organisms. Ideally, several network inference strate- 
gies should be combined for the best overall perfor- 
mance. It is an important challenge for future re- 
search to develop sound statistical methods for op- 
timally combining the output of multiple, existing 
reverse-engineering algorithms. 



Methods 

The E. coll microarray data compendium [2] con- 
tains expression profiles for 4345 genes under 189 
different stress conditions and genetic perturbations. 
We selected a subset of 1882 differentially expressed 
genes (standard deviation larger than 0.5) and used 
a list of 316 known or putative transcription factors 
[11,18] to reconstruct regulatory networks. LeMoNe 
[8] (software available at http:/ /bioinformatics. psb^ 
[ugent .be / so ftware / details /LeMoNe ) identified 108 
ensemble- aver aged modules from 12 independent 
Gibbs sampler runs, containing 1761 genes in to- 
tal. It inferred a ranked list of regulator-module 
edges from an ensemble of 10 regulatory programs 
per module with 100 regulator samples per regula- 
tory program node (see [8] for more details on the 
meaning of these parameters). We applied CLR [2] 
(software available at http://gardnerlab.bu.edu/clr. 
|html| on the data for the 2084 selected genes (the 
union of the 1882 differentially expressed genes and 
316 candidate regulators) and kept all mutual infor- 
mation z-scores between the 316 transcription fac- 
tors and 1882 target genes. As a reference network 
we used RegulonDB version 5.7 [11], a database of 
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4840 known transcriptional interactions in E. coli 
between 167 transcription factors and 1693 genes. 
Recall values are computed with respect to Regu- 
lonDB restricted to the subset of 2084 genes. This 
subnetwork contains 3110 edges between 150 tran- 
scription factors and 1053 genes. We used Eco- 
Cyc [18] to compute functional enrichment of mod- 
ules. Target and functional enrichment in Table [l] 
were computed using a cumulative hypergeometric 
distribution, Bonferroni corrected for multiple test- 
ing, with confidence level 95%. 

The S. cerevisiae microarray data compendium 
[10] contains expression profiles for 6153 genes in 173 
different stress conditions. We used the same sub- 
set of 2355 differentially expressed genes, including 
a list of 321 potential regulators, as used in previ- 
ous studies of this data set [3,8]. LeMoNe was run 
with the same settings as for E. coli and inferred 55 
ensemble- averaged modules containing 1075 genes. 
As reference network we used a network recently 
compiled from the results of genetic, biochemical and 
ChlP-chip experiments [12]. It contains 11785 inter- 
actions between 154 transcription factors and 4047 
genes. After restriction to the subset of 2355 dif- 
ferentially expressed genes, it contains 4513 inter- 
actions between 133 transcription factors and 1628 
genes. The YEASTRACT [21] database contains 
30979 transcriptional interactions in S. cerevisiae 
between 171 transcription factors and 5727 genes. 
After restriction to the subset of 2355 differentially 
expressed genes, it contains 12021 interactions be- 
tween 137 transcription factors and 2182 genes. 
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Regulator Module ID Score Target enrich. Autoreg. Pathway Local Function 



gatR_2 


73 


1912.98 


** 




* 


** 


carbon utilization > carbon compounds 


gadE 


48 


1844.50 


** 


* 


* 


** 


adaptations > pH 


gutM 


38 


1807.24 


** 


* 


* 


* 


carbon utilization > carbon connpounds 


ymfN 


58 


1749.11 








* 




ymfN 


33 


1711.17 








* 




fllA 


12 


1510.48 


** 


* 


* 


** 


motility, chemotaxis, energytaxis; flagella; biosynthesis of flagellum 


rcsB 


62 


1261.72 






* 


* 


biosynthesis of colanic acid (M antigen) 


feci 


57 


1200.77 




* 


* 




adaptations > Fe aquisition 


gatR_2 


42 


1176.55 


** 




* 


** 


carbon utilization > carbon compounds 


yahA 


82 


1171.92 












rcsA 


87 


1151.97 


** 


* 


* 




biosynthesis of colanic acid (M antigen) 


lex A 


20 


996.62 


** 


* 


* 


* 


SOS response; DNA repair; protection > radiation 


lldR 


65 


976.84 


** 


* 


* 


* 


energy metabolism; aerobic respiration 


fliA 


45 


956.70 


** 


* 


* 




motility, chemotaxis, energytaxis 


fliA 


18 


903.46 


* 


* 


* 




biosynthesis of flagellum; motility, chemotaxis, energytaxis; flagella 


nac 


85 


827.17 




* 


* 




nitrogen metabolism 


yiaG 


15 


816.55 












ydaK 


23 


815.75 








** 




ydaK 


154 


805.22 












fnr 


23 


798.27 


* 


* 


* 


** 


energy metabolism; anaerobic respiration; membrane 


Irp 


5 


777.80 




* 


* 




biosynthesis of building blocks > amino acids 


araC 


46 


760.44 


** 


* 


* 


** 


carbon utilization > carbon compounds 


appY 


50 


748.75 












yfiE 


67 


736.50 












osmE 


15 


734.87 












lexA 


78 


726.67 


** 


* 


* 




SOS response 


purR 


144 


708.63 




* 


* 






uidR 


81 


708.36 




* 








araC 


21 


678.10 


* 


* 


* 




carbon utilization > carbon compounds 


yfeG 


29 


663.94 










bl450 


53 


662.16 












flhC 


18 


650.64 


** 




* 




biosynthesis of flagellum; motility, chemotaxis, energytaxis; flagella 


ogrK 


83 


645.35 












fllA 


17 


637.28 




* 








rpoS 


14 


637.13 


** 




* 


* 


adaptations > osmotic pressure 


pdhR 


55 


633.52 




* 


* 




energy metabolism; anaerobic respiration 


tdcA 


31 


619.06 


* 


* 


* 


* 


threonine catabolism; carbon utilization > amino acids 


yebK 


106 


617.44 












araC 


56 


608.17 


** 


* 


* 




carbon utilization > carbon compounds 


csgD 


26 


599.30 




* 








hycA 


66 


596.27 












tdcR 


11 


593.75 






* 




carbon utilization > amino acids 


fliA 


24 


593.05 


* 


* 


* 




flagella; motility, chemotaxis, energytaxis; biosynthesis of flagellum 


chbR 


24 


590.31 




* 








hycA 


29 


563.45 








* 




galS 


76 


561.25 


** 


* 


* 


** 


carbon utilization > carbon compounds 


nip 


77 


559.41 












yfeC 


119 


549.33 












bl506 


36 


548.33 












Irp 


10 


528.90 


* 


* 


* 




biosynthesis of building blocks > amino acids 


cspB 


37 


527.86 












cusR 


68 


515.56 


** 


* 


* 


** 


extrachromosomal > transposon related 


bl284 


51 


514.78 












nanR 


9 


508.87 














90 


496 21 












frp 


126 


493.60 




* 


* 




biosynthesis of building blocks > amino acids 


yjjQ 


179 


491.02 












yehV 


63 


483.29 












OgrK 


27 


481.75 












slyA 


3 


474.43 












ydcN 


16 


467.66 












cpxR 


9 


465.39 


* 


* 


* 




adaptations > other (mechanical, nutritional, oxidative stress) 


yehV 


34 


451.77 










fruR 


63 


449.25 












araC 


64 


441.57 










carbon utilization > carbon compounds 


fis 


19 


436.12 








* 


information transfer > RNA related > tRNA 


fadR 


16 


435.98 












purR 


10 


431.78 










biosynthesis of building blocks > nucleotides 


cadC 


37 


429.32 




* 








feci 


54 


429.28 




* 








rstA 


102 


428.94 












tdcR 


61 


428.84 












flhC 


24 


426.88 


** 




* 


* 


flagella; motility, chemotaxis, energytaxis; biosynthesis of flagellum 



Table 1: Biological validation of the LeMoNe 30% precision network for E. coli. Target enrichment: (*) 
module is enriched in known targets of the predicted regulator, (**) module is most enriched for predicted 
regulator. Autoregulator: (*) regulator is an autoregulator. Pathway: (*) module is enriched in the same 
function(s) as the regulator. Local: (*) regulator is in the same operon as the module genes, (**) Transcrip- 
tion unit of regulator is adjacent to transcription units of the module genes. Function: enriched functions 
of the module. Regulators in bold face are putative regulators without known targets; module IDs in bold 
face consist only of uncharacterized genes. 
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Supplementary information (1 table + 5 figures) 



MODULE 


REGULATOR 


GENE 


EVIDENCE TYPE 






REFERENCE 


5 


Lrp 


gdhA 


ChlP-qPCR 






Faith et al. (2007) 


5 


Lrp 


hisA 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisB 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisC 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisD 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisF 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisG 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisH 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


hisi 


microarray analysis 






Hung et al. (2002) 


5 


Lrp 


pntA 


ChlP-qPCR 






Faith et al. (2007) 


5 


Lrp 


pntB 


ChlP-qPCR 






Faith et al. (2007) 


5 


Lrp 


ydiJ 


microarray analysis 






Hung et al. (2002) 


10 


Lrp 


aroG 


ChlP-qPCR 






Faith et al. (2007) 


10 


Lrp 


leuB 


microarray analysis (indirect influence of Lrp on expression of LeuLABCD accord 


ng to Landgraf et al. (1999)) 


Hung et al. (2002) 


10 


Lrp 


plieA 


microarray analysis 






Hung et al. (2002) 


10 


Lrp 


plieL 


microarray analysis 






Hung et al. (2002) 


10 


Lrp 


purl\/l 


ChlP-qPCR 






Faith et al. (2007) 


10 


Lrp 


tlirA 


ChlP-qPCR 






Faith et al. (2007) 


10 


Lrp 


tlirB 


ChlP-qPCR 






Faith et al. (2007) 


10 


Lrp 


tlirC 


ChlP-qPCR 






Faith et al. (2007) 


10 


Lrp 


yagU 


ChlP-qPCR 






Faith et al. (2007) 


14 


RpoS 


amyA 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


b0753 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


b1953 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


b2080 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


b2086 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


psIF 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


wrbA 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ybaT 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ybaS 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ybaY 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ycaC 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


yccJ 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ycgB 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ycgZ 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


yeaG 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


yedU 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ygjG 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ylibO 


microarray analysis 






Weber et al. (2005) 


14 


RpoS 


ymgA 


microarray analysis 






Weber et al. (2005) 








microarray analysis 






Weber et al (2005) 


14 


RpoS 


ypliA 


microarray analysis 






Weber et al. (2005) 


19 


Fis 


argO 


microarray analysis 






Bradley etal. (2007) 


19 


Fis 


argV 


microarray analysis 






Bradley etal. (2007) 


19 


Fis 


argZ 


microarray analysis 






Bradley etal. (2007) 


19 


Fis 


dnlR 


microarray analysis 






Bradley etal. (2007) 


19 


Fis 


giyw 


microarray analysis 






Bradley etal. (2007) 


19 


Fis 


secG 


microarray analysis 






Bradley etal. (2007) 


20 


LexA 


dinD 


microarray analysis, ChlP-chip analysis 






Courcelle et al. (2002), Wade et al. (2005) 


20 


LexA 


dini 


microarray analysis, ChlP-chip analysis 






Courcelle et al. (2002), Wade et al. (2005) 


20 


LexA 


yafN 


microarray analysis (co-upregulation of ne 


ghbouring ORES, along with dinB) 




Courcelle et al. (2002) 


20 


LexA 


yafO 


microarray analysis (co-upregulation of ne 


ghbouring ORES, along with dinB) 




Courcelle et al. (2002) 


20 


LexA 


yebF 


microarray analysis (co-upregulation of ne 


ghbouring ORES, along with yebG) 




Courcelle et al. (2002) 


20 


LexA 


yebG 










20 


LexA 


yigN 


functional LexA-binding site (electrophore 


ic gel mobility shift assay), microarray analysis, ChlP-chip analysis 


Fernandez de Henestrosa et al. (2000), 














Courcelle et al. (2002), Wade et al. (2005) 


20 


LexA 


yjiW 


ChlP-chip analysis 






Wade et al. (2005) 


23 


FNR 


b1341 


ChlP-chip analysis 






Grainger et al. (2007) 


23 


FNR 


b1342 


ChlP-chip analysis 






Grainger et al. (2007) 


23 


FNR 


ydaA 


ChlP-chip analysis 






Grainger et al. (2007) 


24 


FliA 


b1742 


microarray analysis 






Zhao et al. (2007) 


24 


FliA 


b1760 


microarray analysis 






Zhao et al. (2007) 


45 


FliA 


yjcZ 


microarray-based genetic footprinting, microarray analysis 




Girgis et al. (2007), Zhao et al. (2007) 


45 


FliA 


yjdA 


microarray-based genetic footprinting, microarray analysis 




Girgis et al. (2007), Zhao et al. (2007) 


48 


GadE 


yhiD 










48 


GadE 


yhiU 










48 


GadE 


sip 


microarray analysis (induction by YdeO or GadE?) 




Masuda & Church (2003) 


48 


GadE 


yhiF 


microarray analysis (induction by YdeO o 


GadE?), macroarray analysis 




Masuda & Church (2003), Hommais et al. (2004) 


78 


LexA 


dinP (dir 


B) microarray analysis, ChlP-chip analysis 






Courcelle et al. (2002), Wade et al. (2005) 
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Table SI: New interactions predicted in the 30% precision LeMoNe network validated by literature search. 
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Figure S2: CLR network for E. coli at 30% precision cutoff. 
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Figure S3: LeMoNe network for S. cerevisiae at first 1070 predictions. 



17 




e-H9 e-e- es e-e e-e e-e €^-e e-e- e-e e-e e^-e Oc^s e^-e e-^ Q:^© 



Figure S4: CLR network for 5. cerevisiae at first 1070 predictions. 
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Figure S5: Expression levels of the genes in modules 12, 18, 24 and 45 for E. col% with genes sorted in the 
same order as in Figure 5 (b). Conditions are sorted by the mean expression over all genes and yellow lines 
indicate module boundaries. 
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